Red wine is one of the most beautiful drinks, so it’s going to be interesting to find out what makes a good wine ! :)
The data can be downloaded from this link, also you can find it on my github along with other report resources : link .
Also read this text file which describes the variables and how the data was collected.
The data-set contains 11 chemical characteristics beside a quality from 1 to 10 from at least 3 wine experts for 1599 different wines!
wine <- read.csv('./data/wineQualityReds.csv')
The data has 1599 observations of 13 variables.
str(wine)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Input variables (based on physicochemical tests):
1. fixed acidity (tartaric acid - g / dm^3)
2. volatile acidity (acetic acid - g / dm^3)
3. citric acid (g / dm^3)
4. residual sugar (g / dm^3)
5. chlorides (sodium chloride - (g / dm^3)
6. free sulfur dioxide (mg / dm^3)
7. total sulfur dioxide (mg / dm^3)
8. density (g / cm^3)
9. pH
10. sulphates (potassium sulphate - g / dm3)
11. alcohol (% by volume)
Output variable (based on sensory data):
12. quality (score between 0 and 10)
A closer look on the one variable plots.
table(wine$quality)
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
82.5 % of wines either have quality of 5 or 6 .
Let’s zoom into the correlation between quality and the chemical characteristics :
| variable | Pearson corr |
|---|---|
| fixed.acidity | 0.12 |
| volatile.acidity | -0.39 |
| citric.acid | 0.23 |
| residual.sugar | 0.01 |
| chlorides | -0.13 |
| free.sulfur.dioxide | -0.05 |
| total.sulfur.dioxide | -0.19 |
| density | -0.17 |
| pH | -0.06 |
| sulphates | 0.25 |
| alcohol | 0.48 |
as we can see the only relatively good correlation is with the alcohol percentage.
The below scatter plots between quality and each of the characteristics confirms the correlation values.
Now it’s time to put our question :
which chemcical chracterestics influence the quality, or it there any relation between them !
Logic says yes, but correlations and graphs says the opposite.
Lets think in some different way, instead of searching for the direct relation between each characteristic and quality, let’s find relations between chemical characteristics and each other.
The below correlation table is a good way to find these relations.
The correlations are computed using both Pearson and spearman algorithms, so each element in the table is structured as : Pearson’s / spearman’s .
Correlations bigger than .3 or less than -.3 are colored in Red.
| – | fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol |
|---|---|---|---|---|---|---|---|---|---|---|---|
| fixed.acidity | 1 | ||||||||||
| volatile.acidity | -0.26 / -0.28 | 1 | |||||||||
| citric.acid | 0.67 / 0.66 | -0.55 / -0.61 | 1 | ||||||||
| residual.sugar | 0.11 / 0.22 | 0 / 0.03 | 0.14 / 0.18 | 1 | |||||||
| chlorides | 0.09 / 0.25 | 0.06 / 0.16 | 0.2 / 0.11 | 0.06 / 0.21 | 1 | ||||||
| free.sulfur.dioxide | -0.15 / -0.18 | -0.01 / 0.02 | -0.06 / -0.08 | 0.19 / 0.07 | 0.01 / 0 | 1 | |||||
| total.sulfur.dioxide | -0.11 / -0.09 | 0.08 / 0.09 | 0.04 / 0.01 | 0.2 / 0.15 | 0.05 / 0.13 | 0.67 / 0.79 | 1 | ||||
| density | 0.67 / 0.62 | 0.02 / 0.03 | 0.36 / 0.35 | 0.36 / 0.42 | 0.2 / 0.41 | -0.02 / -0.04 | 0.07 / 0.13 | 1 | |||
| pH | -0.68 / -0.71 | 0.23 / 0.23 | -0.54 / -0.55 | -0.09 / -0.09 | -0.27 / -0.23 | 0.07 / 0.12 | -0.07 / -0.01 | -0.34 / -0.31 | 1 | ||
| sulphates | 0.18 / 0.21 | -0.26 / -0.33 | 0.31 / 0.33 | 0.01 / 0.04 | 0.37 / 0.02 | 0.05 / 0.05 | 0.04 / 0 | 0.15 / 0.16 | -0.2 / -0.08 | 1 | |
| alcohol | -0.06 / -0.07 | -0.2 / -0.22 | 0.11 / 0.1 | 0.04 / 0.12 | -0.22 / -0.28 | -0.07 / -0.08 | -0.21 / -0.26 | -0.5 / -0.46 | 0.21 / 0.18 | 0.09 / 0.21 | 1 |
fixed acidity is correlated to citric acid, density and pH.
volatile acidity is correlated to citric acid and sulphates.
citric acid is correlated to volatile, fixed acidity, pH and sulphates.
chlorides is correlated to density and sulphates.
density is correlated to fixed acidity, alcohol, residual sugar and chlorides.
pH is correlated to fixed acidity and citric acid.
sulphates is correlated to volatile acidity, citric acid and chlorides.
residual sugar is correlated to density.
alcohol is correlated to density.
So we have 7 parent nodes which has children :
Quality, Alcohol, Density, Fixed Acidity, Chlorides, Citric acid and Volatile acidity.
And all of them depend on each other, so as we know alcohol affects quality, alcohol is affected by density which is affected by other chemicals which is affected…. and so on.
With counting negative and positive correlations, quality value increases when the following happen :
volatile acidity
pH
Sulphates
Citric acid
pH
Sulphates
Fixed acidity
Residual sugar
Chlorides
Density
Alcohol
Quality
Lets go back to our question, WHAT CHEMICAL PROPERTIES INFLUENCE THE QUALITY.
To answer that we must go through the earlier tree from the bottom to the top.
The below plots explain that, the fist plot has the Quality as Y(dependent), then the next variable in the tree will be the Y of the next plot and so on .
After we proved the relation between quality and chemical properties, lets build a regression model so in future if we have chemical properties for some wine, we can predict it’s quality.
Simple linear regression uses an independent variable to predict the outcome of a dependent variable.
we will use the formula Y ~ X , where X represents the relations represented above in the tree.
Because the variables are from different scales, so it would be nicer if all of them are scaled to the same scale. I’ll choose the scale from 0 to 10 , so every element in each variable will have a value from 0 to 10 keeping the statistics not changed.
A new variable is set for the new data called ‘wine.ratio’.
Now lets look at the model :
reg_lm <- lm( quality ~
alcohol * density +
density * fixed.acidity +
density * residual.sugar +
density * chlorides +
chlorides * sulphates +
fixed.acidity * pH +
fixed.acidity * citric.acid +
citric.acid * pH +
citric.acid * volatile.acidity +
citric.acid * sulphates+
volatile.acidity * sulphates
,data = wine.ratio )
# print some information about the model
mtable( reg_lm, sdigits = 3 )
##
## Calls:
## reg_lm: lm(formula = quality ~ alcohol * density + density * fixed.acidity +
## density * residual.sugar + density * chlorides + chlorides *
## sulphates + fixed.acidity * pH + fixed.acidity * citric.acid +
## citric.acid * pH + citric.acid * volatile.acidity + citric.acid *
## sulphates + volatile.acidity * sulphates, data = wine.ratio)
##
## =============================================
## (Intercept) 5.230***
## (0.359)
## alcohol 0.205***
## (0.040)
## density -0.012
## (0.066)
## fixed.acidity -0.013
## (0.091)
## residual.sugar -0.081
## (0.061)
## chlorides 0.133
## (0.117)
## sulphates 0.217**
## (0.069)
## pH -0.028
## (0.042)
## citric.acid 0.110
## (0.085)
## volatile.acidity -0.202***
## (0.038)
## alcohol x density -0.003
## (0.007)
## density x fixed.acidity -0.006
## (0.009)
## density x residual.sugar 0.015
## (0.009)
## density x chlorides -0.020
## (0.022)
## chlorides x sulphates -0.036*
## (0.014)
## fixed.acidity x pH 0.031*
## (0.013)
## fixed.acidity x citric.acid -0.006
## (0.010)
## pH x citric.acid -0.036**
## (0.013)
## citric.acid x volatile.acidity 0.015
## (0.009)
## sulphates x citric.acid -0.001
## (0.012)
## sulphates x volatile.acidity -0.004
## (0.019)
## ---------------------------------------------
## R-squared 0.362
## adj. R-squared 0.354
## sigma 0.649
## F 44.847
## p 0.000
## Log-likelihood -1566.812
## Deviance 664.474
## AIC 3177.623
## BIC 3295.920
## N 1599
## =============================================
I choose three plots to summary the analysis we did :
The first one is visualization of the regression model.
I’ll graph box plots for the formula Y ~ X, where Y (wine Quality) as a factor on the x-axis, and X is as shown above the relations between chemical properties and each other on the y-axis.
I’ll use the new data-set here wine.ratio.
As shown above the mean of X is getting higher as quality get higher for the quality ( 3,5,6,7), an exception for 4 and 8, the mean of X at quality 4 is lower than the mean at 3, and the mean at quality 8 is lower the mean at quality 7.
But still we can say the as quality increases the X increases.
The second one will show the difference between the actual quality, and the quality predicted by the regression model. Lets start first make a new column in the data called “quality.predicted”, it will hold the predicted data using the regression model.
wine$quality.predicted <- round( predict(reg_lm, wine.ratio ) )
Now lets plot it :
The bars shows the number of wines having a quality x.
The red ones for the actual quality, and the blue are for the predicted quality.
Most of the predicted quality are 5 and 6, and a little of 7. The model couldn’t predict the quality of 3,4 and 8.
Instead it predicted 5 and 6 more than the actual one.
The last one shows the density of each chemical property over each quality level from 3 to 8 .
Sulphates has low values in quality levels 3, 4, 5 and 6, and a little bit higher in 7 and 8.
Chlorides most values at quality level 8 are of value 1 , and as quality level goes lower the number of rows having value 1 reduces.
Residual sugar has low values for all quality levels.
Fixed acidity for low quality levels 3,4 and 5, have most of it’s values under 3, but in quality levels 6, 7 and 8 it’s values are spread out.
Other chemicals are spread on the graph for all quality levels.
We started by wondering about the relation between the quality of wine and it’s chemical properties, it’s clear that there must be a relation, although we proved some week relation but it still week and we can’t count on it .
So how does this make sense !, If we trusted that the chemical test were true and there is no error in the data, so there is error in the human factor !, lets not to forget that the quality is entered by humans and humans always make mistakes!.
So I believe to some degree that many values of the quality are entered from person favorite not because it’s actually high quality.
So how we really get that relation between wine’s quality and it’s chemical properties ?.
I don’t believe that diving deeper in this data set would give me the answer. So to get the answer we have to find the best data set for it, maybe that data would contain prices, brands, and more accurate quality or drinkers’ review.
Also chemical properties aren’t everything that matters in wine, there still the type of the grape used, the quality of wine brand, any flavors added and other things that haven’t been considered in the data-set.
Another thing, the fact the most of the quality values are 5 or 6 makes it harder to analysis the data, there are no very good wines ( of quality 9 or 10), or very bad wines ( of quality 0, 1 or 2), which confirms also that the data aren’t strong enough to use it and as I said humans make mistakes.
The data-set used in this report :
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at:
Elsevier
Pre-press (pdf)
bib